No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile

While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.


Introduction
One of the more exciting recent developments in empirical social science research is the increasing availability of large administrative databases and the ability to link across them to crowdsourcing [26], this approach is not applicable to all kinds of datasets and research contexts. An alternative approach to data linking is to use an active learning algorithm. Like supervised learning, the goal is to generate ground truth labels upon which the linking algorithm can be trained. However, the approach is far more surgical. Whereas supervised learning requires that the research team have access to a relatively large ground truth database, active learning algorithms request labels from the researcher for high leverage cases and learn from a much smaller number of ground truth labels [27][28][29]. These approaches typically work by presenting the human labeler with examples to label, starting with the example whose label would yield the most information. The researcher can label as many or as few examples as they want, using either their own time constraints or the diminished utility of labeling an additional example as a stopping criteria. While this approach is thought to be an improvement over unsupervised linking methods that do not use ground truth at all, the consequences on linking quality have not been rigorously analyzed. In particular, we may be concerned that the number of ground-truth examples needed to maintain linking performance below a particular maximal error rate might grow as a function of the size of the data sets and the relative frequency of mismatches between the data sets involved. Simply put, the concern is that the potential benefits of an active learning approach are necessarily limited by the paucity of ground truth data that active learning algorithms can access.
In this paper, we investigate the value of ground-truth examples for linking performance. Using a popular open-source record linkage algorithm, dedupe [30], we explore how the payoff to investing in ground truth labels varies based on linking context, such as the difficulty of the linking task, the size of the training dataset, and the size of the datasets involved in the linking task itself. Specifically, we fix the share of true matches that do not match exactly on name and date of birth at 50 percent (which is likely at the "messier" end of data sets we have observed in the field) and vary the following parameters to explore the responsiveness of the linking error rate to the difficulty of the matching task: the size of data set that was used to train an algorithm and the size of the pool of potential records.
Our principle contributions are as follows: 1. We provide a roadmap for applied social science researchers-especially those who are unfamiliar with the technical data science literature on record linking-to perform data linking efficiently when ground truth information is unavailable.
2. We confirm the common sense finding that data linking can be dramatically improved with the availability ground truth examples. But critically, we note that in many applications, only a relatively small number of tactically-selected ground-truth examples are sufficient to obtain most of the achievable gains.
3. We show, that in the context of a common linking task, with a modest investment in obtaining ground truth examples, researchers can quickly surpass the performance of an unsupervised algorithm and, instead, approximate the performance of a sophisticated supervised learning algorithm that has access to a large database of ground truth examples using a simple and readily available off-the-shelf linking tool.

Data and methods
In the present study, we have ground-truth information on true matches, which allows us to report true error rates in order to assess linking performance. Since there may be certain true matches that will be very difficult for any algorithm to find (e.g. if the identifying information is completely different across records), we also estimate a proxy for the "achievable" error rate via an algorithm that has access to all ground-truth examples. To do so, we employ the Name Match Python package and compare the error rate from Name Match to the rates from dedupe. We establish a performance baseline using a popular unsupervised record linkage tool called fastLink.

Data
Our study protocol was reviewed by the University of Oregon Institutional Review Board (Protocol Number 09052018.005) and was determined to qualify for exemption as per the Common Rule regulations found at Title 45 CFR 46.101(b)(4). This review determined that no informed consent process was applicable for the current study. Our test linking data is a set of identified administrative records on three million charges filed in Oregon courts between 1990-2012, maintained in the Oregon Judicial Information Network (OJIN). The individual records in the OJIN data contain a number of relevant data points: name, date of birth, race, and gender of the defendant; the date the case was filed, a description of the charge, and a unique identification number that links the same defendants across rows in the data set.
Around 77% of the sample is White, while Hispanic and Black individuals each makes up around 10% of the sample. Nearly 80% of the individuals are male, and the average age is around 32.6 years (SD = 11.1 years). We link the data using first name, middle initial, last name, date of birth, race, gender and age. We use the unique identifier as our indicator of ground truth to assess whether the algorithms are able to correctly identify links at the person level. Among the records that we can identify as the same person, because they share a unique identifier, 63.7 percent match exactly on name, 90.7 percent match exactly on date of birth, and 59.7 percent match exactly on both name and date of birth (see S1 Appendix for details about how far off non-exact matches are in the data). We collapse the file to the case level, because there is no variation in a person's identifying information across charges within a case. This leaves us with 1.3 million cases.

The linking task
The record linkage task we are simulating using these data is a link between a small data set from a research experiment, which we'll call E (one record per person), and a larger administrative data set, D (potentially multiple records per person). Our set up mirrors a common linking task in a number of settings, such as linking the names of youth employment program participants to the list of arrestees to examine the crime-reduction effect of employment, or linking the list of election candidates to the list of corporate directors to study if politically connected business had easier access to loans [31,32]. We further subdivide E and D into training E t and D t and evaluation E e and D e portions, since we do not want to evaluate results of a learning algorithm on in-sample data. We also ensure that there is no overlap between the training and evaluation data sets. In the specifications to follow, the size of the training and evaluation sets are the same. Experimental data sets E t and E e are always set to 1,000 unique individuals. The administrative data sets vary between 5,000, 50,000, and 200,000 records, where D t and D e are the same number of records in each specification. In other words, in every specification, we can think of our objective as attempting to link a person-level experimental data set of 1,000 observations to a case-level data set of either 5,000, 50,000 or 200,000 records. Finally, we hold the distribution of records per person constant at values reasonably similar to the full data set's natural distribution.
The distribution of records per person in the administrative data D is as Table 1 shows: We fix the true match rate, meaning the rate of individuals in the experimental data E who have a true match in the administrative data at 50% to facilitate performance evaluation. In other words, 50% of the people in the experimental data set have one or more records in the administrative data set. If our linking algorithm performs perfectly, we would expect to find a link in D for the 500 people from E that have one. Among these true matches, the share of matches where the records in the experimental data E and the administrative data D match exactly on first name, last name and date of birth is approximately 50% for both the training and evaluation sets. Meaning that of the 1,000 individuals in E, 500 have a record in D and 250 of those have a record in D that matches the information in E exactly on first name, last name, and date of birth.
This particular type of linking task is becoming increasingly common in the social sciences, particularly as it relates to the proliferation of the "low-cost" randomized controlled trial, in which a data set of participants from an intervention is linked to existing administrative databases to capture outcome data in one or more domains [33].

Running dedupe
The dedupe tool can perform linking tasks using either a traditional record linkage framework, which only considers links between the administrative database D and the experimental data E or a de-duplication linkage framework in which the records in the experimental data E are appended to the administrative data set D making one large database with the number of records equal to the sum of the records from E and D. From there, links are identified via deduplication meaning that, unlike the record linkage framework, the process considers links within the administrative database D and also within the experimental data E. To streamline the presentation of results here, we present results for the de-duplication linkage framework, because it tends to have better performance in this context than the traditional record linkage framework. Although the patterns we discuss hold for both kinds of linkage frameworks (see S1 for the traditional record linkage framework results).
Evaluating all possible pairs of records as potential links is often too computationally intensive to be feasible in real-world linking tasks. As a result, record linkage processes begin with a blocking step designed to reduce the number of comparisons required during linking by eliminating those that have low likelihood of being true matches. During blocking, records are organized into groups that consist of broadly similar records according to particular criteria. For example, blocking on birth year would mean that only records with the same birth year will be considered as potential links. The goal is to choose blocking criteria to discard as many nonlinks as possible without discarding true links. Once the data is organized into blocks, the linking task proceeds by assessing pairs of records within blocks to find potential links. This restriction of the comparison space can greatly reduce runtime of the linkage task, but often forces the researcher to make an a priori decision about which blocking rules to impose. Dedupe's blocking step is part of the active learning process. The optimal blocking rules are part of what is learned via the active learning process, with dedupe learning blocking rules to maximize the share of true pairs that make it past the blocking step. Accordingly there are no a priori decisions that the user has to make about what defines a block before the linking process begins.
This active learning component, which typically requires users to label some number of pairs as either a match, non-match, or unknown, is a critical component of dedupe's linking process. As previously noted, active learning algorithms work by identifying unlabeled potential links that are likely to provide the most information for the training procedure. Dedupe's active learning step starts by creating four different types of queues, in decreasing order of expected gain in predictive information.
1. Type A: Record pairs that are not blocked together, but have high link probabilities (greater than 0.5) 2. Type B: Record pairs that are blocked together, but have low link probabilities (less than 0.5) 3. Type C: Record pairs that are blocked together and have high link probabilities 4. Type D: Record pairs that are not blocked together and have low link probabilities At each step, dedupe presents the human labeler with a sample from the non-empty queue with the highest priority. For example, if there are no record pairs that meet the criteria for the first queue, then a record pair from the second queue is sampled for human labeling. Predicted probabilities are generated from a regularized linear regression model using the instances labeled up to that point as the outcome. The flowchart below documents dedupe's process (Fig 1). We simulate human input by feeding dedupe the labels it requests from ground truth data in order to ensure consistency. By automating the active learning process using ground truth data, we are evaluating dedupe's performance under optimal conditions, in which there are no errors in the labeling process. In other words, we are assessing dedupe's performance under the condition that what it learns from the labeling process is entirely correct. The active learning component of dedupe asks the user to label training pairs until it reaches a minimum of 10 positive and 10 negative labels. While dedupe does not limit the training process, the website notes that it should be rare to need more than 50 positive and 50 negative cases to train the algorithm on even the most difficult and messy data sets [34]. There are parameters in dedupe that control the set of pairs that are available during the active learning step of the linking process. The "sample size" parameter determines the number of label-eligible record pairs and the "blocked proportion" parameter influences what share of label-eligible record pairs are similar pairs vs. random pairs. We will show that in the context of our linking task, performance was highly sensitive to the choice of the sample-size parameter. In contrast, our results were not sensitive to variations in blocked proportion. As a result, we fixed blocked proportion at the deduplication framework's default value of 0.90. A value of 0.90 means that 90% of all the record pairs that the user could be asked to label are records that were similar enough to appear in the same block at least once. The remaining 10% are random pairs of records that, most likely, look nothing alike. Finally, we also make a minor adjustment to the train function. The "recall" parameter is set to 1 by default, meaning that the algorithm must find blocking rules such that 100% of labeled matches are blocked together. We adjusted the recall parameter to.99, because occasionally there are real world cases where an individual actually has two very different looking records. We run dedupe four times for each specification of our linking task in order to observe any variation in linking performance that may occur due to random sampling performed by the algorithm. The version of dedupe used in this paper is v2.0.6, with minor modifications to allow for automated labeling.

Implementation platform
We ran dedupe with access to 8 Intel Xeon Platinum 8180 CPUs and 700 gigabytes of RAM. One of the benefits of linking using dedupe is that it can be run on a personal laptop and does not require specialized hardware. While we did not conduct an extensive run-time analysis, the relatively short duration of even the longest dedupe run (an average of 134 minutes for a specification with a budget of 1,000, an administrative dataset size of 200,000, and a sample size setting of 150,000), suggests that dedupe's runtime is a very reasonable investment for the purposes of constructing an analytic data set.

Benchmarking using a supervised learning algorithm
Since administrative data linking is a noisy task under the best conditions, solely comparing dedupe's performance to the true link rate is a stringent standard. Accordingly, we also conducted a benchmarking exercise so that we had a point of comparison for dedupe's performance that was more indicative of what might be possible for a linking algorithm that had access to all the ground truth information, which is likely the best possible performance under real world conditions. To establish this best-case-scenario benchmark, we used a supervised record linkage algorithm, Name Match, to link each combination of experimental dataset E and administrative dataset D. Name Match was given access to the same data fields dedupe used, plus all of the ground truth information available. We used the Name Match tool's default settings, including the default blocking scheme which considers two records a potential link if their birthdays are within 2 character edits of each other and their names are reasonably similar. Results are based on Name Match version 1.1.0.

Benchmarking using an unsupervised learning algorithm
To demonstrate the performance gains achievable through modest investment in generating ground truth labels, we also establish a performance baseline using one of the most popular unsupervised record linkage tools, fastLink. This allows us to understand expected linking performance in a circumstance where none of the ground truth information is available. Specifically, we used fastLink to link each combination of experimental dataset E and administrative dataset D. We gave fastLink access to the same seven data fields dedupe and Name Match used, but none of the ground truth information. When using fastLink, we chose a blocking scheme such that two records were considered a potential link if their birth years were within one year (window.block) and their last names were loosely similar (kmeans.block with 2 clusters). This is a reasonable blocking scheme and closely resembles the blocking scheme used in Name Match. Otherwise, all of fastLink's parameters were left to default values. Results are based on fastLink version 0.6.0.

Evaluation metrics
There are a number of candidate performance metrics we can use to assess the performance of record linking tools. We focus on three evaluation metrics: precision, recall, and total error. Precision and recall are commonly used to evaluate the performance of information retrieval and prediction tasks, because they each focus on a different type of error. In the context of data linkage, high precision indicates fewer false positives. In other words, the algorithm is not linking records that belong to two different individuals. Whereas, recall focuses on false negatives. High recall means that the algorithm is not missing the true links that do, in fact, belong to the same individual. While, ideally, an algorithm has both-high precision and high recall-there is an inherent trade off between the two: fewer false positives typically results in more false negatives, and vice versa. In some cases, the relative costs of false positive and false negative links varies, so it is useful to assess these two measures separately. The total error metric-which is a combination of the false positive rate (FPR) and the false negative rate (FNR)-is also key because the bias introduced by linkage errors is a function of the total linking error [11]. Naturally, to minimize the bias introduced by the linking process, we want to minimize the total error metric.

Off the shelf performance
We begin by assessing dedupe's "off the shelf" performance in the task of linking the experimental data set, E, to administrative data sets, D, of varying sizes. Fig 2 presents the results of this exercise using dedupe's deduplication framework to link E to D. We perform the task with dedupe's default settings: the "sample size" parameter set to 1,500 label-eligible pairs and blocked proportion set to 0.90. We compare dedupe's performance to that of an unsupervised learning algorithm (fastLink) and a supervised learning algorithm (Name Match).
In each of the plots, the dashed peach line represents the performance of the unsupervised learning algorithm (fastLink) and the solid black line represents the performance of the supervised learning algorithm (Name Match). In order to generate confidence intervals around algorithm performance, we developed a bootstrapped inference procedure (described in detail in S1 Appendix), the shaded areas on the plots represent the bootstrapped 95% confidence interval. Each of the points represents a separate run of dedupe. (Note that some of the runs had nearly identical performance, such that they are not individually distinguishable in the figure.) The plots in Columns 2 (n = 50,000) & 3 (n = 200,000) of Fig 2 had extraordinarily high total error rates, much higher than the "baseline" performance of the unsupervised learning algorithm and in some cases approaching 100 percent. For example, the best performance for 50,000 admin rows was 33 percent total linking error even after feeding the algorithm 1,000 accurate ground truth labels. The results in Column 3 reveal even larger error rates. Here the best total error rate was 54 percent-for the task of linking an experimental data E of 1,000

PLOS ONE
Improving administrative data linking using active learning individuals to an administrative data set D containing 200,000 rows with 1,000 accurate ground truth labels. This is a concerning result given that many administrative data sets that might be targeted for linking tasks are quite large. In order to investigate the reason for such high error rates in Fig 2, we next examine the sensitivity of the results to varying each of dedupe's hyperparameters (i.e., the set of adjustable parameters).

"Sample size" implications
We found that linking performance using dedupe is highly sensitive to the "sample size" parameter as the size of the administrative data set D increases. In other words, for larger administrative data set sizes, dedupe needs a larger pool of candidate pairs from which it can select pairs to be labeled during the active learning component of the linking process. Fig 3  shows the performance of the linking task for "sample sizes" of 1,500 (default), 15,000, 150,000 and 300,000 candidate pairs. With respect to performance, all of the sample sizes performed similarly with respect to precision (Fig 3, Panel B). While precision performance is not very sensitive to sample size, increasing the number of candidate pairs to which dedupe has access when it is looking for candidates to label has large effects on recall performance (Fig 3, Panel  C). As Panel C, Column 3 shows, recall never exceeded 40 percent on average even when we provided accurate labels for 1,000 of the 1,500 candidate pairs. By contrast, for all three specifications, a sample size of 300,000 candidate pairs performed well on both precision and recall and, by extension, total error. The "sample size" induced variation in dedupe's performance is The median column compares the performance of the median runs of each algorithm. The worst-case column compares the best run of default dedupe to the worst run of "sample size"-adjusted dedupe. * Indicate significant difference in total error rate between "sample size"-adjusted dedupe and default dedupe. Share significant indicates the percentage of comparisons where "sample size"-adjusted dedupe was a significant improvement over default dedupe.
https://doi.org/10.1371/journal.pone.0283811.t002 driven by the availability of "high value" candidate pairs that, when labeled, provide the most predictive information for the algorithm. These "high value" candidate record pairs are relatively rare, and therefore increasing the sample size parameter allows dedupe's active learning step access to more "high value" pairs to present for labeling (see S1 Appendix for a detailed analysis of the types of candidate pairs selected for labeling at various sample sizes). We performed a bootstrap inference procedure to test whether adjusting the sample size parameter has a significantly better performance than default dedupe (see S1 Appendix for details). For each bootstrap sample and performance metric, we take the difference between the performance metric values for "sample size"-adjusted dedupe and default dedupe. We perform this process for each linking specification considered, i.e. size of dataset D and the label budget provided to dedupe. Thus, for each linking specification, we have a bootstrap distribution of the difference in performance metrics between adjusted and default dedupe. We consider the performance of "sample size"-adjusted dedupe to be significantly better than default dedupe if the difference in performance is less than zero at the 95th percentile of the empirical distribution for total error. We find that changing dedupe's sample size parameter to 150,000 potential pairs results in significantly better performance than the default version of dedupe in the majority of comparisons. In Table 2) the median column compares the total error rates for the median run of "sample size"-adjusted dedupe and default dedupe. The worst-case column compares the best run of default dedupe to the worst run of sample size adjusted dedupe. The asterisks in the table indicate that the difference in performance between "sample size"-adjusted dedupe and default dedupe is less than zero at the 95th percentile of the empirical distribution of differences in total error rate. The table also presents the percentage of comparisons for which "sample size"-adjusted dedupe was a significant improvement over default dedupe. The benefit to increasing the sample size parameter is most notable when the administrative database D is relatively large. When D has 50,000 rows, we see significant improvements when we adjust the sample size parameter, where the number of labels is 80 or more. In that case adjusting the sample size results in a significantly lower total error rate in 60-100% of possible comparisons depending on the number of labels that are used. The benefits to adjusting dedupe's sample size parameter are even more dramatic when the size of the administrative database D is 200,000 rows. In those tests, the "sample size"-adjusted version of dedupe performed significantly better than default dedupe (with respect to total error) for between 80-100% of possible comparisons. However, there are costs to the speed of the process by substantially increasing sample size. To balance run time costs and performance, we shifted to a sample size of 150,000 for all specifications.  Fig 2, indicating that the runs were much more likely to meet the baseline training threshold of 10 positive and 10 negative links during the labeling process when dedupe is given access to a larger pool of candidate links during the labeling process. However, we also see that even conditional on meeting the training criterion of 10 positive and 10 negative labels dedupe's performance improves when we provide more ground truth labels up to a point when it starts to flatten out around 200 labels. We see from the figure that there are diminishing returns to providing ground truth labels after approximately 200 labels. We found that the performance improvement levels off after 200 labels, because after that point there are fewer and fewer "high value" candidate pairs (we provide a detailed demonstration of this relationship in the S1 Appendix). This is a key result because it means that in linking contexts when there are no pre-existing ground truth labels available, linking using active learning performs well with reasonable investments in ground truth labels. The performance improvements are largely driven by improvements in recall performance.

Main results
The result is even more remarkable put in context. Fig 4 shows the performance of the unsupervised learning algorithm, fastLink, as a baseline and the performance of the supervised learning algorithm, Name Match, as a benchmark for the linking performance we might expect under the best case scenario of access to a large amount of ground truth information. For all of the administrative data set sizes we tested, we are able to outperform the popular unsupervised learning algorithm and achieve performance similar to (or better than) a supervised learning algorithm by using dedupe's active learning algorithm and an investment in providing approximately 200 ground truth labels (or fewer in some circumstances). To formally differentiate between the performance of the algorithms, we used the same bootstrap inference procedure (described in detail in S1 Appendix). Table 3 presents the results of these comparisons. In Table 3 the median column compares the total error rates for the median run of "sample size"-adjusted dedupe to fastLink (Panel A) or Name Match (Panel B). The worstcase column compares the best run of fastLink to the worst run of "sample size"-adjusted dedupe (Panel A) or the best run of Name Match to the worst run of "sample size"-adjusted dedupe. The asterisks in the table indicate that the difference in performance between "sample size"-adjusted dedupe and fastLink or Name Match is less than zero at the 95th percentile of  Table 3. Statistical comparisons of total error rate between "sample size"-adjusted dedupe with fastLink and Name Match. the empirical distribution of differences in total error rate. The table also presents the percentage of comparisons for which "sample size"-adjusted dedupe was a significant improvement over fastLink (Panel A) and for which Name Match was a signficant improvement over "sample size"-adjusted dedupe (Panel B). In these tests, dedupe outperforms fastLink with respect to the total error rate for nearly every type of comparison tested, including all but one instance comparing dedupe's worst performance to fastLink's best performance. Indeed, for this particular linking task, in all but two scenarios (label budget = 20, D = 5, 000 & D = 50, 000), dedupe significantly outperformed fastLink in 100% of possible comparisons. We also tested whether the total error rates for dedupe and Name Match were statistically distinguishable from one another. However, unlike with the previous comparisons, where an analyst will want to be able to statistically distinguish between the performance of each of the comparisons, here, the most salient question is whether dedupe can return total error rates that are not statistically significantly different from those returned by Name Match. For administrative database sizes 5,000 and 50,000, provided dedupe has at least 80 labels, dedupe's performance is indistinguishable from Name Match's performance. When the administrative database is 50,000 rows, Name Match performed significantly better than dedupe only when we compared the worst performing run of dedupe to the best performing run of Name Match and, even then, in less than one-third of the possible comparisons when we provided dedupe with at least 200 labels. For the largest administrative database size we tested (200,000 rows) we see that although the results are not quite as dramatic, dedupe still performs to a level that is statistically indistinguishable from Name Match in more than 40% of possible comparisons when we provide at least 200 labels. This is a very encouraging result. With a few small adjustments to an open source package, researchers who have data linking needs but little specific linkage expertise can leverage active learning and a relatively small investment in providing ground truth labels to achieve remarkable performance even if their data is quite messy.
Importantly, we are not suggesting that active learning with a modest number of labels approximates the linking performance we might expect from a supervised linking process in all linking cases. However, in this specific case, we are able to approximate the performance of a supervised learning algorithm with a small number of ground truth labels. While the case we set up is specific, it is also a very common linking task in empirical social science.

Discussion
In this paper, we investigate the value of ground-truth examples for linking performance.
Using a popular open-source record linkage algorithm, dedupe [30], we explore how the payoff to investing in ground truth labels varies based on the context and the difficulty of the linking task, such as the size of the training dataset and the size of the datasets involved in the linking task itself. We show that in many applications, only a relatively small number of tacticallyselected ground-truth examples are sufficient to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a sophisticated supervised learning algorithm that has access to a large database of ground truth examples using simple and readily available off-the-shelf tools. Furthermore, our results show that making a modest investment in labeling yields significant gains over methods that do not leverage ground truth information at all.
Of course, it is important to discuss some of the limitations of our work. First, the number of labels required to maximize achievable performance is likely to grow with the size of the administrative dataset being linked. Our results speak to linking scenarios using databases that have fewer than 200,000 observations. Next, as we show in this paper, linkage performance is related to the difficulty of the matching task. We have explored one particularly common and important type of linking task in this paper, matching individuals when we have access to full name and date of birth information. To the extent other types of linkages are more or less difficult, that would have implications for the number of ground-truth labels needed to obtain maximum achievable performance. To that end, further avenues for research would include exploring other types of linkages: not having access to name and/or dob in full, or non-person linking such as linking location or organization names. Third, we explored how many labels are required to get back to overall levels of performance. However, there may still be differences in performance by subgroups such as by race or gender.

Conclusion
Most often, social scientists need to link data without access to ground truth or to a collaborator who can implement "state of the field" machine learning linking techniques. Although many applied research projects could benefit from implementing processes like active learning to perform record linkage tasks, there is relatively little practical guidance available. Often applied social science researchers consider record linkage to be a pre-processing step, rather than an early-stage component of their analytic plan. However, as prior work has demonstrated, record linkage errors can increase bias and diminish statistical power [11]. To that end, our paper serves as a guide for how researchers can benefit from the power of record linkage without learning data scientific methods from scratch. An added benefit is that using a publicly available linking tool like dedupe promotes transparency and replicability since the underlying code is freely available, which is particularly important for applied researchers who have little formal training in data scientific methods. Our ultimate goal is to reinforce the point that data linking is an important component of empirical analysis with implications for how social scientists understand their analytic results and to demonstrate that in many cases relatively minor investments in ground truth labeling using an open-source package can reduce the consequences of linking error and improve the validity of empirical social science research. Chalfin.